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Abstract 

In this paper I explain the reasons that 
led me to research and conceive a novel 
technology for dependency parsing, mix¬ 
ing together the strengths of data-driven 
transition-based and constraint-based ap¬ 
proaches. In particular I highlight the 
problem to infer the reliability of the 
results of a data-driven transition-based 
parser, which is extremely important for 
high-level processes that expect to use cor¬ 
rect parsing results. I then briefly intro¬ 
duce a number of notes about a new parser 
model I’m working on, capable to proceed 
with the analysis in a “more aware” way, 
with a more “robust” concept of robust¬ 
ness. 

1 Introduction 

To ease the reading of this article I have decided 
to adopt a general descriptive approach as much as 
possible to present my research as well as its un¬ 
derlying motivations, rather than a more specific 
and formal one that will be used in next papers. 

The structure of this paper is as follows: I 
start with background information on Natural Lan¬ 
guage in section |2]and then on Dependency Pars¬ 
ing in section [S] In sections |4]and [^Fll focus on 
two approaches that fulfil the dependency parsing 
task, respectively data-driven transition-based ap¬ 
proach and constraint-based approach. Section 
explains the motivations for a new approach. The 
final section |7] briefly introduces a new hybrid 
parser model I’m working on that proceeds with 
the analysis in a “more aware” way, whose mean¬ 
ing is explained in the next sections. 

2 Natural Language 

Natural Language is a very complex system, in¬ 
volving many brain processes. Trying to repro¬ 


duce it by means of artificial agents has been one 
of the main goals of Artificial Intelligence since its 
early days. For more than 50 years, linguists and 
computer scientists have tried to make computers 
understand human language fighting against its 
fascinating and misleading nature: implicit, highly 
contextual, ambiguous, often imprecise and con¬ 
tingent to biological processes. Indeed the lan¬ 
guage appears subject to the whims of evolution 
and cultural change on one hand, and based on 
strong rules that constrain the possible sequences 
of phonemes or words on the otheiQ. 

As a matter of fact it is hard to deny that lin¬ 
guistic production and comprehension are based 
on a system of formal regularities, and that some 
of these regularities have a stable behaviour in a 
given moment of the historical evolution of a given 
language. Furthermore, these regularities are pre¬ 
cisely what legitimate the use of the term “system” 
when one talks about language. These ideas are 
well explained by a quote from De Mauro (1990) 
whose a free translation follows: “On one hand 
there are, well founded, the reasons that assim¬ 
ilate the language to a calculation. On the other 
hand, no less strong, there are the reasons that pre¬ 
clude such assimilation. Theorists and philosoph¬ 
ers generally have opted for accentuation of one or 
the other. (...) It seems to us that a good theory of 
the language must take into account those and the 
other reasons”. 

This calculation vs. non-calculation dilemma 
is the reason why - to name but one - Chomsky 
(1965) distinguishes between competence system 
and performance system, giving rise to principles 
such as grammaticality and acceptability of a sen¬ 
tence as well as the distinction of its surface and 
deep structure. While these issues attract a lot of 
attention from a theoretical point of view, they are 
not so relevant for the realisation of a practical 

'For example, in written language the sequence Article + 
Article (the the cat) is meaningless if not unacceptable. 



system for natural language analysis: what is im¬ 
portant is to acknowledge that gradation is a cent¬ 
ral phenomenon in natural language, which means 
that not all sentences fit in to the binary distinc¬ 
tion of grammatical vs. ungrammatical; some are 
simply “slightly better” than others without the lat¬ 
ter having to be completely rejected. 

From a computational perspective I can really 
say that this dichotomy was the root of the “the¬ 
ory split” that still today characterizes most of the 
studies and the research on Natural Language Pro¬ 
cessing (NLP), split as we said into two branches 
that for simplicity I can call rules and statistical 
based approaches]! 

3 Dependency Parsing 

Dependency Parsing is an attractive alternative to 
constituency parsing for syntactic analysis, com¬ 
monly considered one of the fundamental steps 
for linguistic processing because of its key import¬ 
ance in mediating between linguistic expression 
and meaning. 

The theory behind dependency parsing is based 
on the Dependency Grammar which can be proud 
of a long-standing tradition in linguistics: sev¬ 
eral theories and formalisms (Tesniere (1959), 
Sgall (1986), Mel’cuk (1988), Hudson (1990), 
Maruyama (1990)) share the fundamental assump¬ 
tion that syntactic structure consists of word-to- 
word dependencies i.e. lexical nodes linked by 
binary asymmetrical relations called dependen¬ 
cies. Dependency Grammar is sometimes called 
Valency Grammar, a name conceived by analogy 
with chemical valency, according to which some 
words (especially verbs) have valencies depend¬ 
ent on the number of elements (e.g. nouns) with 
which they combineH 

In a dependency structure, every word is de¬ 
pendent on, at most, another word (its governor)]! 
This means that the structure can be represented 
as a dependency tree, where nodes are words and 
arcs are dependency relations (e.g. subject, dir¬ 
ect object, modifier). Another requirement for a 
well-formed dependency tree is that there is pre¬ 
cisely one root, which is usually the main verb 
of the sentence (as a consequence of the “verbo- 

^Here the term statistical is improperly used because we 
also include probabilistic approaches. 

^Because of this analogy, sometime it is possible call lex¬ 
ical nodes “atoms”. 

"’Alternative terms in the literature are regent and head for 
governor, and modifier or argument for dependent. 


centricity” theory). Thus, the task of a dependency 
parser is to take a sentence (input text) represented 
by a sequence of words (nodes) and enrich it with 
the appropriate set of labeled dependency arcs. 
Each labeled dependency arc involves exactly two 
words and a label]! More formally, dependencies 
can be represented as a set of directed arcs of the 
form g d, where g is the governor node, d is the 
dependent node {g ^ d) and i is the label, resulting 
in a dependency structure called dependency tree 
(parse tree). For more details on dependency tree, 
dependency grammar and dependency parsing see 
Nivre (2003) and the references cited therein. 

As we will see later, this is an over¬ 
simplification, if nothing else because not all the 
words that should be considered are expressed in 
the input text: there are “words” hidden in the 
surface structure but essential to keep up a syn¬ 
tactic structure. This happens because understand¬ 
ing a sentence means to translate the linear or¬ 
der mainly originated for physical reasons (during 
which some elements may be lost) into a struc¬ 
tural order, bringing back its original hierarchical 
structure (Tesnire, 1959)]! I’d like to think that if 
we could communicate “telepathically”, without 
physical constraints (linear sequence of words), 
then the information would be transferred directly 
from one brain to the other as a structure similar to 
a dependency tree, with all the elements naturally 
hierarchically organized. 

Back to the main subject, several approaches 
have been developed to fulfil the dependency pars¬ 
ing task. In this paper I’ll focus on the data-driven 
transition-based approach and on one of its “coun¬ 
terpart”, wcdg-parsing, based on the Weighted 
constraint dependency grammar (WCDG), which 
is grounded on a specific descendant of the 
constraint-based approach (Heinecke et al. 1998; 
Schroder, 2002). These two approach are rep¬ 
resentative of both statistical and rules worlds 
respectively, and both of them achieve similar 
overall state-of-the-art results. See Krivanek and 
Meurers (2011) for a comparison of a statistical 
and a rule-based dependency parser. 

An overview of the two approach will be given 
in the following sections, introducing crucial key 
concepts to understand the motivations for a new 
parser and the technology that could be adopted 

^Tbe only exception is for tbe root node wbicb can have a 
label but not a true governor. 

®Time-linearity in spoken language and space-linearity in 
written language. 




for its implementation. 

4 Data-driven transition-based approach 

I assume the reader is familiar with the formal 
framework of transition-based dependency pars¬ 
ing originally introduced by Nivre (2003). 

To summarize, transition-based parsing is based 
on a transition system that processes the input sen¬ 
tence by means of transitions which incrementally 
build the dependency tree. The sequence of trans¬ 
itions is called computation. The system is ini¬ 
tialized to an initial configuration based on the in¬ 
put sentence, to which transitions are applied re¬ 
peatedly generating new configurations until the 
final configuration is reached. Given a config- 
urafion, fransifions can create a dependency arc 
befween governor and dependenf nodes. Trans- 
ifions fhaf create arcs encode in ilself several in- 
formalion, including arc direction (lefl vs. righf), 
from which if is possible fo idenlify fhe involved 
nodes, and a label represenfing fhe name of fhe 
synfaclic relation. 

Dafa-driven means fhaf fhere is no need of a 
hand-wriffen grammar and fhe analysis process is 
guided by fhe words (fhe dafa) of fhe senfence 
ifself. This approach fakes advanfage of fhe in¬ 
creased availabilify of dependency freebanks (i.e. 
senfences manually annofafed in parse free formaf) 
and of fhe recenf techniques fo apply machine 
learning algorifhms fo nafural language processing 
fo implemenf a fraining procedure able fo generate 
an “inductive grammar”. 

During fhe learning phase, fhe freebank sen- 
fences are processed and fhe parser learns how 
fo use fhe fransifions emulafing fhe fransifions se¬ 
quence implemenfed in fhe oracle funcfion by 
means of a classifier, so fhaf learning a gram¬ 
mar means learning fo selecf whaf is fhe besf nexf 
fransifion giving a configurafion. Oracle is fhe 
name given fo fhe funcfion fhaf maps a parser con¬ 
figurations fo opfimal fransifions wifh respecf fo an 
annofafed senfence (gold free). Giving a configur¬ 
afion /fransifion pair, a sef of feafures is exfracfed 
from fhe configurafion summarizing if and used fo 
properly frain fhe classifier. 

Features are one of fhe mosf imporfanf type of 
informalion fhaf dafa-driven systems use fo lead 
fhe analysis process. After fhe fraining phase 
(whereas fhe fransifions are esfablished in advance 
by fhe oracle), fhese feafures musf be sufficienfly 
robusf fo guide fhe parsing process wifhouf a com¬ 


plete view of fhe senfence. Dafa-driven fransifion- 
based parsers use only confexfual informalion i.e. 
a limited window of nodes, including fhe already 
parsed ones, cenfred around fhe focus poinl of fhe 
analysis. Feafures fhaf lake info accounl fhese ele- 
menls can encode combinations of several words 
properfies e.g. forms, pos-fags, arc labels of fhe 
words ifself or of ils lefl and righf dependenls. 
Zhang and Nivre (2011) proposed a rich feafures 
template considering Ihird-order feafures, linear 
dislance befween a pair of possible governor and 
dependenf, valency informafions. 

This approach is represenfed by fhe models 
of Yamada and Malsumolo (2003), Nivre (2003, 
2004), Alfardi (2006), Nivre (2009), Goldberg and 
Nivre (2013), Sarlorio (2013)|l| These models 
mainly differ by fhe way Ihey define configura¬ 
tions i.e. sef of available fransifions as well as fhe 
ability fo handle discontinuous synlaclic conslruc- 
lionsjj 

As a side effecl of Iheir incremenfal behaviour, 
dafa-driven fransifions parsers have limiled look 
ahead capabilities (i.e. Ihey are limited fo local 
feafures) so Ihey are affected by fhe problem of 
having fo decide sometimes loo early how fo pro¬ 
ceed wifh fhe analysis, before having seen fhe re¬ 
maining pari of fhe senfence, so fhey likely make 
mislakes, especially on long dislance dependen¬ 
cies (Bohnel, 2011). This search errors cause er¬ 
ror propagalion (McDonald and Nivre, 2007) i.e. 
when fhe parser makes an error, fhe probabilily 
fhaf if makes olhers increase because if enters info 
configuralions for which if has nol been Irained so 
if does nol know how fo read (Goldberg, 2013). 
The mosf common approach is fo use beam al- 
gorilhm insfead of exploring only a single deriv¬ 
ation for each inpul (greedy decoding). The draw¬ 
back of fhe use of fhe beam search is fhaf parsing 
speed are nol fasl as fhe original greedy Iransilion- 
based parsers. 

Recenlly, several changes have occurred wifh 
respecf fo fhe original approach in order fo mitig¬ 
ate Ihis error propagalion problem wifhouf altering 
parsing lime. The mosf prominenl approach is fo 
use dynamic oracles in combination wifh online- 
learning lechniques, enabling an error-exploration 
procedure fhaf improves fhe way in which classi- 

^The arc-standard and arc-eager models are two of the 
most widely known and used transition-based system. 

^Discontinuous syntactic constructions are also called 
non-projective dependency structures because of the presence 
of crossing edges. 



fiers learn from data. In essence, error-exploration 
consists in exposing the training procedure (then 
the classifier) to non-optimal configurations (com¬ 
putations that do not lead to a gold tree) ob¬ 
tained following sometimes erroneous predicted 
transitions together with the optimal transitions 
for those configurations, providing the parser with 
a sort of self-consciousness of its own mistakes. 
Parsers that exploit these flexible oracles achieve 
state-of-the-art results for greedy parsing, with a 
big difference in terms of accuracy compared to 
static oracles -typically 1-2%, with no differences 
in parsing time, obtaining scores comparable to 
those of the best statistical graph-based parsers 
(McDonald et ah, 2005). 

These parsers usually work monotonically since 
arcs are only added to but never removed from 
the set of dependencies. Honnibal (2013) sug¬ 
gests an “error-repairing” strategy, implemented 
as a non-monotonic version of the arc-eager sys¬ 
tem, which combines error-exploring technique 
with some relaxing of the transitions precondi¬ 
tions, allowing the parser to recover the correct arc 
from the wrong governor assignments forced by 
the past incorrect transitions. Although the idea of 
error-repairing is very interesting, its real recovery 
capabilities are very limited as for now, providing 
an improvement of up to 0.2% accuracy (Honni¬ 
bal, 2013). 

5 Constraint-based approach 

In this section I’ll focus on wcdg-parser (Foth and 
Menzel, 2006), a mature implementation of wcdg- 
parsing based on the WCDG grammar. WCDG 
extends the CDG formalism first described by 
Maruyama (1990), and it was demonstrated to be 
appropriate for modelling a large variety of lin¬ 
guistic phenomena such as immediate dominance, 
agreement, valence, aspects of word order and 
projectivity. 

In this approach the parsing problem can be 
viewed as a constraint satisfaction problem (CSP). 
A recent introductions to CSP can be found in 
Miguel and Shen (2001). The dependency pars¬ 
ing main process is seen as the problem of finding 
a dependency free for a sentence that satisfies the 
constraints defined by a hand-written grammar. 

The WCDG main aspect concerns the possibil¬ 
ity to express graded constraints rather than hard 
grammar rules: to each constraint is assigned a 
weight or penalty between 0.0 and 1.0 that indic¬ 


ates its importance. The weight 0.0 is associated 
to hard constraints which theoretically can only be 
violated when no other solution is possible, while 
different weights (soft constraints) assign prefer¬ 
ences among many linguistic phenomena. Fur¬ 
thermore, the formalism of WCDG provides dy¬ 
namic constraints which do not have a static score 
but receive different weights depending on the con 
text in which they are evaluated. Usually the con¬ 
straints and their related weights are determined 
by the grammar writer. Recent works attempted to 
compute the weights of a WCDG automatically by 
observing which weight vectors perform best on a 
given corpus (Schroder et ah, 2001), but weights 
computed completely automatically failed to im¬ 
prove on the original hand-coded grammar. 

For instance, constraints can express that: 

• preferably the top node is a verb (soft con¬ 
straint); 

• preferably the top node is & finite verb (soft); 

• a node does not have more than one object 
(hard constraint)!^ 

• determiners must precede their governor 
(hard) and that it is most often a noun (soft); 
or that there cannot be two determiners for 
the same governor (hard); or that a determ¬ 
iner and its governor must agree in number 
and gender (soft). 

• an article modifies a nearby noun (dynamic 
constraint); 

Wcdg-parser uses beside constraints an 
information-rich (e.g. valence) hand-crafted 
lexicon (Foth, 2006). Further details can be found 
in Foth (2004) and Foth and Menzel (2006). 

Since the general CSP is an NP-complete 
problem, also wcdg-parsing can result in non¬ 
termination and efficiency problems0 Instead of 
a full search, wcdg-parser uses a heuristic search 

special dependency label, called extra-obj, could be 
used in order to allow ripresa pronominale due to dislo- 
cazione a sinistra, terms borrowed from the Italian language 
where is very frequent this type of syntactic constructions. 

'^Indeed, some solutions of language processing al¬ 
gorithms that would be ideal in theory have a complexity that 
corresponds to the NP-Complete problems: a trade-off exists 
among the solutions theoretically most elegant and the solu¬ 
tions that can be implemented practically. 



called Frobbing, a non-monotonic transformation- 
based constraint resolution method with anytime 
properties (Foth et al., 2000)1^ 

Wcdg-parser tries to find an analysis (a depend¬ 
ency tree) by transforming a given one until it can¬ 
not be improved further. In Frobbing, an arbitrary 
dependency structure is constructed first from the 
input sentence, then the algorithm tries to correct 
analysis errors selecting transformations, based on 
constraints that cause conflicts (constraints that 
are violated for specific dependency arcs). Given 
an analysis, transformations generate new analysis 
changing a set of local properties such as a la¬ 
bel or governor of a dependency relation as well 
as a pos tag of a node or its morphological fea¬ 
ture (e.g. case, gender, number, mood, tense, 
etc.). A set of conflicts is then recomputed and 
the most severe (weight close to 0.0) of them is 
attacked by transforming the analysis. This res¬ 
ults in an analysis which is not necessarily better 
(i.e. with less severe conflicts, since other con¬ 
flicts may be created in this step) but that does 
not have that specific attacked conflict anymore. 
If the conflict can not be removed, the algorithm 
tracks back to the last starting analysis - resulting 
in a search strategy similar to tabu search. The 
whole process is repeated until a new better ana¬ 
lysis is found and marked as the new starting ana¬ 
lysis. The algorithm ends when no other analysis 
improvements are possible. 

The constraint-based approach is useful es¬ 
pecially for richly inflected languages and free 
word order such as, for example, Italian and 
German which, according to recent experiments, 
have a syntax considerably more difficult to ana¬ 
lyse than English. Dubey and Keller (2003) and 
Grella (2011) show better results for constraint- 
based parsers with respect to statistical parsers, re¬ 
spectively, in an evaluation on NEGRA treebank 
(Brants et al., 1999) for German and TUT Tree- 
bank for Italian (Bosco et al., 2000) . 

6 Motivations for a new approach 

Nowadays, and even more in the future, an ef¬ 
fective syntactic parser is involved in any cutting- 
edge application devoted to text processing (docu¬ 
ment indexing, information extraction, automatic 

"Anytime property: the parser maintains a complete ana¬ 
lysis at any time so the algorithm it can be stopped at any time 
and return a complete analysis. Anyway there is a trade-off 
between parsing time and quality of results so the time left 
for an analysis generally coincides with a better accuracy. 


translation, sentiment analysis, ...), so the parser 
must be robust i.e. able to produce analysis for 
any type of input including non-canonical mater¬ 
ial such as spoken language transcription or e- 
mails which may include syntactically ill-formed 
sentences. Another crucial feature of syntactic 
parsers is the efficiency due to the increased avail¬ 
ability of data (the Big Data) to analyse. 

The syntactic parser output is the starting point 
of other high-level processes which expect to use 
correct parsing results, so it is extremely import¬ 
ant to be able to predict the reliability of the res¬ 
ults of a parser. Otherwise, using incorrect parsing 
information, a degradation of the applications per¬ 
formance is almost guaranteed. 

Among the advantages of data-driven 
transition-based parsers there are state-of-the-art 
accuracy and the linear time complexity of many 
of them. Greedy parsers are the fastest approach 
for dependency parsing, enabling web-scale 
parsing with high throughput. 

This parsing approach seems appealing not only 
from an engineering perspective due to its ef¬ 
ficiency, but also from a psycholinguistic point 
of view as they process a sentence incrementally 
much the way that people do, thing that has mo¬ 
tivated several studies concerning their cognitive 
plausibility (Nivre, 2004; Boston and Hale, 2007; 
Boston et al., 2008). 

Erom a cognitive prospective, the data-driven 
statistics nature puts this approach inline with Bod 
(2003): 

Eanguage displays all the hallmarks of 
a probabilistic system. Grammaticality 
judgments and linguistic universals are 
probabilistic and stochastic grammars 
enhance learning. All evidence points to 
a probabilistic language faculty. 

and with Norvig (2011): 

It seems clear that probabilistic mod¬ 
els are better for judging the likelihood 
of a sentence, or its degree of sensibil¬ 
ity. But even if you are not interested in 
these factors and are only interested in 
the grammaticality of sentences, it still 
seems that probabilistic models do a bet¬ 
ter job at describing the linguistic facts. 

Unfortunately, it is difficult to infer the reliabil¬ 
ity of the results of a data-driven transition-based 



parser and this assumption does not fit well in case 
of actual implementation of statistical depend¬ 
ency parsing: there is no straightforward mapping 
between the parser output score, if any, and some 
simple notion of grammaticalitvF^ On the con¬ 
trary, Fong and Berwick (2009) found that, despite 
their results, such parsers fail to incorporate much 
“knowledge of language” in many cases: they fail 
to replicate many empirically attested grammatic- 
ality judgments; seem overly sensitive, rather than 
robust, to train data idiosyncrasies; and easily ac¬ 
quire “unnatural” syntactic constructions. 

A possible explanation is that usually the data- 
driven dependency parsers drop accuracy in do¬ 
mains outside the data from which they were 
trained, and the enthusiasm generated by the “un¬ 
derlying semantics” that seems to be assimilated 
in this model is revealed actually quite fragile!^ 
The reasons for this can be found in some differ¬ 
ent distributions of morpho-syntactic features ex¬ 
tracted from the set of sentences in the treebank 
which, despite they are numerous, are anyway lim¬ 
ited in number and typology with respect to the 
general language. For example, the famous Penn 
Treebank corpus (Marcus et ah, 1993) is one of 
the largest treebank but it is dominated by finan¬ 
cial news from the Wall Street Journal that con¬ 
tains quite a peculiar linguistic phenomenon as 
journalistic expressions. More trivially, state-of- 
the-art dependency parsers use a highly sparse lex- 
icalized model: it means that the features are cre¬ 
ated using word forms and lemmas (when avail¬ 
able) so that co-occurrences of certain words in 
the given treebank are combined into lexicalized 
syntactic ngrams features (dependency tree frag¬ 
ments). Therefore, ideally all possible valid word 
combinations that the parser will face during pars¬ 
ing should be recorded in a treebank, which is un¬ 
likely to happen if we consider both the limited 
size of this resource and the fundamental onni- 
formativity principle of natural language, accord¬ 
ing to which languages may express any learn¬ 
ing experience (De Mauro, 1990). On the other 
hand, in the treebanks there are syntactic and dis¬ 
tributional homonymous structures (John Lyons, 

'^Despite this, in the context of Domain Adaptation with 
Active Learning, Attardi (2011) uses the score that the parser 
itself provides as a useful measure of the perplexity in parsing 
a sentence. 

*^How to increase the accuracy of a parsing system when 
dealing with out-of-domain texts is the goal of domain ad¬ 
aptation task. Usually, techniques such as self-learning or 
active-learning (Attardi, 2013) are used. 


1970); in other words the same surface structures 
(i.e. pos sequences) can match very different syn¬ 
tactic analysis, depending on the basis of semantic 
relationships among words. For this reason, hom¬ 
onymous structures cause troubles to a delexical- 
ized parser, with an accuracy difference between 
lexicalized and not lexicalized parser greater than 
6% in the test set of the same domain of the train¬ 
ing set, showing that lexicalized features seem in¬ 
dispensable. Basically, lexical information are im¬ 
portant but too sparse. Techniques have recently 
been introduced to broaden the spectrum of words 
and not to limit the feature identity to a match of 
an exact word. Koo and Carreras (2008) improved 
parsing accuracy and coverage substituting the 
words form with an attempt to “semantic words 
knowledge“ in the form of clusters which merge 
words according to contextual similarity extrac¬ 
ted from very large corpus0 “Cluster-based” fea¬ 
ture sets have progressively boosted by recent dis¬ 
tributed word representations (word embeddings) 
where each word is represented by a dimensional 
dense vector, instead of clusters encoded in static 
strings (Collobert et ah, 2011). As for clusters, 
word embeddings are learnt from large corpus in 
such a way that concepts with similar or related 
meanings are near each other in that space i.e. 
similar words are expect to have close vectors!^ 
Chen and Manning (2014) developed a transition- 
based parser using a deep learning architecture 
which exploits word embedding as features and 
also creates a dense vector representation for pos 
tag and arc label instead of a discrete representa¬ 
tion. Their parser, as well as the Attardi’s (2009) 
and Grella’s (at Evalita 2014) ones, take advant¬ 
age of the neural network architecture (multilayer 
perceptron) that already incorporates non-linearity 
in the hidden layer to infer the interaction starting 
from “atomic features” (one features for each ele¬ 
ment properties), so that a manually designed fea¬ 
ture template (e.g. pairs or triplets of word prop¬ 
erties) that provides a form of non-linearity useful 
for linear classifier like averaged perceptron, is no 


binary representation, clusters serve as coarse lexical 
intermediaries and are equivalent to bit-string prefixes from 
which prefix length determines the granularity of the cluster¬ 
ing e.g. 01 fruit, 010 apple, 011 orange (e.g. Brown’s cluster 
algorithm (1992). 

'^This is possible because word embeddings can be in¬ 
duced directly from widely available unannotated corpora of 
different domains otherwise not covered by traditional lin¬ 
guistic resources. 




longer requiredf^ 

These techniques performed well with an av¬ 
eraged of -1-0.7 with respect to traditional lexical- 
ized model, nevertheless a strong dependence re¬ 
mains with the context in which the parser has 
been trained, and they don’t solve at all the prob¬ 
lem of distinguishing, as people do, between ac¬ 
ceptable and not acceptable sentences. 

Furthermore, in the case of ungrammatical in¬ 
put sentence there is no guarantee that data-driven 
transition-based parsers yield the “right” wrong 
analysis as the most probable analysis. If we con¬ 
sider that also with a correct grammatical input it 
is not possible to determine the reliable level of 
confidence of the analysis in output, to distinguish 
among correct and incorrect parser results seems 
to be not feasible. 

All this makes it quite evident that the presence 
of a grammar (rules or constraints) is essential. In 
this direction, as mentioned in a previous section, 
a competitive wcdg-parser has been developed for 
unrestricted German input that is largely inde¬ 
pendent from domain and achieves state-of-the-art 
results. The wcdg-parser, thanks to a grammar that 
“extended” the notion of grammaticality, is able to 
produce an effective score that can be used to de¬ 
termine the degree of acceptability of a given ana¬ 
lysis, together with an accurate characterisation of 
the input text by means of a list of unremovable 
soft and hard violated constraints. 

Constraint scores also help to guide the parser 
towards the optimal solution and allow the parser 
to deal with the input in a robust way. Anyway, 
since Frobbing is a heuristic procedure, at the end 
of the algorithm there’s no certainty that the op¬ 
timal solution has been found. This means that 
it sometimes fails to find fhe correcf dependency 
sfrucfure of an inpuf senfence even if fhe language 
model (i.e. fhe enfire sef of consfrainfs includ¬ 
ing lexical information) accurafely defines if, be¬ 
cause of search errors during heurisfic opfimiza- 
fion. In addifion fo fhis, allhough many defeasible 
consfrainf exisl which allow buf disprefer cerfain 
consfrucfion, fhere are many more possible buf im¬ 
plausible dependency sfrucfure lhal are not dispre- 
ferred. A reason for this is that sometimes some 
possible syntactic constructions are distant from 
any structural ambiguity perceptible by a human 
and so it is difficult to conceive them before they 

manually designed feature template, by definition, suf¬ 
fers from the following problems: sparsity, incompleteness, 
expensive features computation. 


are computed by the parser. This is part of a typ¬ 
ical issue of any hand-written grammar formalism: 
to write rules by hand about every noun or verb 
of a language, which seems more and more ne¬ 
cessary once one gets closer to more specific lan¬ 
guage phenomena, is simply infeasible. 

In any case, efficiency of wcdg-parsing as if 
now is, is clearly nol compefifive wifh lhal of slal- 
islical dependency parsing, wifh parse limes of 
several minufes, or, even worse, of hours in case 
of some complex sentence, making fhis approach 
unusable in aclual confexf. 

The mailer has now arrived lo fhe poinl of Nor- 
vig and Chomsky debate (2011), lhal recall fhe 
nafural language properties described in secfion |2] 
Norvig suggesls lhal “probabilistic, framed mod¬ 
els are a heller model of human language per¬ 
formance lhan are categorical, unlrained models”, 
meanwhile Chomsky objecls lhal “If’s Irue Ihere’s 
been a lof of work on frying fo apply sfafislical 
models lo various linguislic problems. I Ihink 
fhere have been some successes, buf a lof of fail¬ 
ures. There is a notion of success ... which I Ihink 
is novel in fhe hislory of science. If inlerprels suc¬ 
cess as approximating unanalyzed dafa. (...) We 
cannol seriously propose lhal a child learns fhe 
values of 109 parameters in a childhood lasling 
only 108 seconds.”. 

Wifh a clear undersfanding of fhe limitations 
and benefits of both data-driven and constraint- 
based approach, in the next section I briefly in- 
Iroduce a lechnique I’m working on fo build an 
innovafive fasl hybrid technology for dependency 
parsing lhat combines fhe slrenglh of bolh of Ihese 
approaches, wifh fhe objecfive lo creale a new 
parser lhal is able lo proceed wifh fhe analysis in a 
“more aware” way. 

Before leaving fhis section where I fried lo ex¬ 
plain fhe reasons for a new approach, if is imporl- 
anf fo nofice lhal, regardless fhe approach used, ar- 
chifeclures derived from fhe research in NLP Ira- 
difionally separales fhe process of language ana¬ 
lysis info a series of more simple lasks execufed 
in sequence, sacrificing fhe advanfages of possible 
parallelism, as in many silualions if would be be¬ 
neficial lo exploif informalion being produced by 
one lask while performing anolher lask. Moreover, 
fhis archilecture may suffer from error propagation 
problems, especially when clearly inferdepend- 
enf lasks are modelled separately, as in fhe case 
of fhe lemmalizafion, parl-of-speech tagging and 



syntactic parsing. For instance, a typical model 
of syntactic parser presupposes that input words 
have been morphologically disambiguated using 
first a lemmatizer, then a part-of-speech tagger be¬ 
fore parsing begins. This is bad especially for 
richly inflected languages such as - among others - 
French, German, Italian and Spanish (also known 
as morphologically rich languages), where there 
is a considerable interaction between morphology 
and syntax such that neither can be fully disam¬ 
biguated without considering the other. 

Even the boundaries between what is the do¬ 
main of syntax and what is the domain of se¬ 
mantics is very thin: one can just take a look 
at the well-known PP-attachment problem or the 
correct identification of several conjuncts involved 
in a coordination chain, the anaphoric references 
and more|3 Some (yet fundamental) words can 
be omitted in the surface structure when they are 
to some extent implicit or semantically inferable 
(e.g. because of null elements resulting by pro¬ 
drop). All this suggests that the syntactic and se¬ 
mantic analysis should be performed together. To 
have a resulting parse tree complete, the mechan¬ 
isms that guarantee this completeness must find 
their place and role in the parsing process. These 
mechanisms should include traces integration and 
words sense disambiguation. 

7 Notes about a new approach 

I think that, while a probabilistic component is es¬ 
sential to resolve ambiguity in both syntax and 
semantics, it is also crucial to equip the analysis 
system with a stable linguistic knowledge. My 
premise is that in the natural language it seems it 
is possible to distinguish between possibility and 
probability of a given utterance. 

As a matter of fact, the idea of combining lin¬ 
guistic rules based component with a statistical en¬ 
gine is widely used, especially for machine trans¬ 
lation systems. In a dependency parsing context, a 
method have been proposed to add statistical com¬ 
ponents as “oracles” to a constraint-based parser: 
Foth and Menzel (2006) developed a hybrid ver¬ 
sion of the wcdg-parser, which uses a probabilistic 
transition-based parser as an initial statistical pre¬ 
dictor component. In their model, the output of the 
statistical parser (Nivre 2003) is converted to soft 

'^More generally, a well known problem in parsing tasks 
is the dependency ambiguity: for a given sequence A-B-C, 
both interpretations A(B(C)) and A((B)(C)) may be structur¬ 
ally possible. 


constraints which encourage the constraint solver 
to create first the same dependencies of the statist¬ 
ical parser, leaving then to the Frobbing algorithm 
the hard work to find a better solution only if the 
dependencies created by the first system generate 
conflicts!^ They proved that the use of a stat¬ 
istical parser enables the wcdg-parser to produce 
a better initial attachment (which means that less 
time has to be used to correct attachment errors), 
and that such mixed approach is useful to reduce 
some characteristic issues such as modeling and 
search errors, in particular for long and complex 
sentences. The reason for this is that correct easy 
attachments are rather common, and those that 
require deeper analysis are comparatively rarely 
However, the system remains slow, with a sen¬ 
tence analysis times of the order of seconds. 

I suggest a different approach from Foth and 
Menzel (2006): in a new parser, the two compon¬ 
ents could be used at the same time and not se¬ 
quentially as a pre- or post-processing of the other 
component. My aim is to incorporate linguistic 
insights (weighted constraints) into a fast data- 
driven transition-based system: the idea is that the 
constraints mainly control the possibility of syn¬ 
tactic expressions (grade of acceptability respect 
to the grammar) and the statistical component uses 
their probabilities (i.e. the grade of confidence 
which encode a sort of language competence ac¬ 
cording to the statistical model trained on corpora) 
to dynamically guide the parsing process to the 
solution. At the beginning I was fascinated by 
the human behaviour of a data-driven transition- 
based parsers. Today I think that such parsers are 
more useful to navigate the search space in order 
to avoid the evaluation of unlikely solutions, than 
heuristic searches such as Frobbing can do. 

I think also that a non-monotonic behavior, ob¬ 
tained through back-tracking, transformations or 
beam search, is fundamental to combine in the 
same process more tasks such as pos-tagging and 
dependency parsing, exploring a larger space that 
best suits to the amount of ambiguity to be con¬ 
sidered. In order to mitigate the error-propagation 
problem, the new parser could compute a number 
of alternative syntactic structures in parallel using 
a beam search algorithm (parallel parsers) shar- 

** A weight of 0.9 has proved to work best on an evaluation 
using the sentences 501 to 1000 of the NEGRA corpus. 

'^Here the concept of easy attachment is borrowed from 
Goldberg and Elhadad (2010) i.e. the arcs that the statistical 

predictor can get with high reliability, as noun article. 




ing information among different analysis, in con¬ 
trast with the parsers that compute only a single 
preferred analysis (serial parsers). For efficiency 
reason, the beam size may be limited to 10 so 
the parser would explore a very small fraction of 
the many possible analyses whose number grows 
exponentially. Actually, I’d like to think that in 
human reading process few alternatives are re¬ 
tained until the ambiguities and uncertainties are 
resolved. And this, beyond the obvious reasons 
of efficiency, seems to me more reasonable than a 
beam size of 64 used by Zhang and Nivre (2011). 

Despite the use of a beam search, no proper 
analysis could be found in case of difficult or ill- 
formed constructs that always violate a hard con¬ 
straint. To handle this, in the new approach I intro¬ 
duced the concept of unknown arc, a dependency 
label used if the best syntactic connections found 
continues to violate a hard constraint and there¬ 
fore there is no theoretical and formal justification 
to specify other label. A next level to syntactic 
parsing can then take into account this insight!^ 
Among the advantages, the new hybrid ap¬ 
proach allows the use of constraints with a high 
level of abstractions. A subset of known universal 
linguistic knowledge is successfully used in unsu¬ 
pervised dependency parsing achieving state-of- 
the-art accuracy in that context (Naseem, 2010). 
This knowledge may be converted in constraints 
that could be used to control the execution of a 
transition given by the classifierl^ By means of 
general universal constraints become easy to fil¬ 
ter some implausible construction proposed by the 
statistical component e.g. the maximum degree of 
multiple center-embedding of clauses is exactly 3 
in written language (Karlsson, 2007 

Practically, in my experiments the data-driven 
transition-based component is based on the arc- 
standard model (Nivre, 2004), extended with: 

a. a wait transition (Yamada and Matsumoto, 
2003) used to create hypothesis (that take part 
of features) about some “delayed dependen¬ 
cies” during a shift, due to a pure bottom-up 

^®This is different from dep Stanford dependency that is 
used when the system (or a human) is unahle to determine a 
more precise dependency relation between two words. 

^'Some parameters used by universal constraints (e.g. 
dominant subject, verb, object sequence order) should be 
learned during a first reading of the treebank. 

^^No real examples of degree 4 have been recorded. In 
spoken language, multiple center-embeddings even of de¬ 
gree 2 are so rare as to be practically non-existing (Karlsson, 
2007). 


arc-standard strategy, approximating the be¬ 
havior of the arc-right transition in the arc- 
eager model (Nivre, 2003) which create arcs 
in a more incremental way, in line with psy- 
cholinguistical point of view that postulate 
humans tend to make predictions about syn¬ 
tactic structure and process local attachments 
first (Gibson, 2000); 

b. the non-adjacency transitions proposed by At- 

tardi (2006) to handle non-projective depend¬ 
ency structure maintaining linear time com¬ 
plexity. Even if the non-adjacency trans¬ 
itions have an incomplete coverage of non- 
projective structures, Attardi (2006) notes 
that a distance lower than 3 is sufficient to 
handle almost all cases of non-projectivity 
in the training data of almost all languages. 
This model takes advantage of a great intu¬ 
ition of Attardi (2014) to extend the use of 
non-adjacent arc transitions used so far only 
to handle non-projectivity, also for recover¬ 
ing a overlooked proper arc; 

c. a specific fransifion fhaf considers fhe top node 

from the second to the last configuration, 
following the None approach discussed by 
Ballesteros and Nivre (2012). In their model, 
if no dummy root node is added (at the be¬ 
ginning or at the end of a sentence), then 
there is no explicit transition to link the top 
node, meaning that the last node remaining 
in a configuration will be treated as root de¬ 
pendent. By contrast, in this model a specific 
root fransifion is used fo assign fhe fop label 
fo fhe fop node. As a maffer of facf only when 
fhe parser sees fhe whole free, if can verify fhe 
“infegrify” of fhe solufion; 

d. a special mechanism integrated with arc trans¬ 

itions to treat punctuations, in a way similar 
to Ma et al. (2013). Traditional transition- 
based parser considers punctuations as well 
as words although these are not as consist¬ 
ently annotated in treebanks as words, mak¬ 
ing it harder to parse. In the new parser, be¬ 
fore the creation of the initial configuration, 
some punctuations (e.g. commas) are first 
attached as a properties of their right neigh¬ 
bouring words, then removed from the input 
sentence. Such information are propagated 
from dependent to governor step by step and 
used as features; 



e. an enrichment of the information encoded into 
transitions, in a way similar to Bohnet and 
Nivre (2012) who introduce a transition- 
based system that jointly performs pos tag¬ 
ging and dependency parsing encoding pos- 
tag information in the shift transition. The 
new technique I’m proposing here consists 
in enriching the arc transitions (e.g. arc- 
left, arc-right) adding the pos-tag to the syn¬ 
tactic label (deprel), so the parser can easily 
performs with a single transition pos-tagging 
and syntactic analysis. Using this technique 
where dependency labels natively consist in 
pos-deprel combination (e.g. noun-subj, adj- 
rmod), words may have more than one de¬ 
prel in order to handle “agglutinate words” 
without a preprocessing taskr^l 

Technical details, necessary to explain these ex¬ 
tension as well as the successive notes, will be the 
subject of a next paper. 

The application of constraints occurs whenever 
the statistical transitions system proposes an arc 
creation. 

As a matter of fact, that is exactly the mo¬ 
ment where traditional transition-based depend¬ 
ency parsers already impose certain constraints: 
not all transitions predicted by the machine learn¬ 
ing algorithm are valid at each configuration, due 
to preconditions in the transition system, so be¬ 
fore the execution only the valid transitions are 
sorted and the best one executed. Nivre, Gold¬ 
berg and McDonald (2014) extend the transitions 
of the arc-eager model with preconditions for dif¬ 
ferent constraints, in order to block some of these 
according to some fixed criferia. In fheir empirical 
case studies fhey consider fhe problem of parsing 
commands fo personal assisfanfs such as Siri or 
Google Now. In fhis specific confexf if is plaus¬ 
ible fhaf if fhe firsl word of a command is a verb, 
if is likely fhe roof of fhe senfence. Using a simple 
consfrainf as “fhe firsf word of fhe senfence musf 
be fhe roof” fhey have achieved an imporfanf ac¬ 
curacy improvemenf (over 3%), demonsfrafing fhe 
effectiveness of fhe use of additional information 
sources nol direcfly inferable af fhe framing time. 

^^For example in the Italian language, the word della 
(di+la) may have a deprel prep-conn art-det, and word lec- 
cala (leccare-hla} may have a deprel verb-top pron-dobj. 

^"'Grella et al. (2011) during 

the workshop of Evalita 2011, in 
http://www.evalita.it/sites/evalita.fbk.eu/ 


The new parser uses fhe pos-fag information en¬ 
coded in fhe fransifion as a precondifion, in a way 
similar fo Nivre ef al (2014). For each arc frans- 
ifion, if is able fo check if fhe sef of all fhe pos¬ 
sible available lexical readings of fhe large!, fhe 
dependenl node, confains fhe desired pos- fagfl If 
so only compatible lexical readings are mainfained 
(fhe parser slarls wilh all possible lexical readings 
for each word), ofherwise fhe fransifion is rejected. 
This mechanism (which is an essential pari of fhe 
online morphological disambiguation process) en¬ 
sures fhaf fhe dependenl is morphologically valid, 
bul nolhing can be inferred aboul fhe validily of 
fhe governor, wilh respecl fo ils new dependenl, 
and Ihen fo fhe consistency of fhe dependency re- 
lalion jusf crealedj^ However, fhe more a gov¬ 
ernor collecls ils dependenfs, fhe more fhe chances 
increase fhaf a consfrainf could help fo disambig- 
uale also fhe governor lexical readings. For in- 
sfance, fhe consfrainf “verb-aux depends only on 
a verb”, resfricfs fhe possible readings of fhe gov¬ 
ernor, even conslraining fhe realizafion of fhe de¬ 
pendency relalion ilself or reducing ils reliabilily 
score. 

In fhis model fhe consfrainls can acl on several 
levels of analysis and fhey are nol only able fo 
filter (passive checks) bul also capable fo gener¬ 
ate new information during fhe analysis process, 
which in lum may be used by fhe sfafislical com- 
ponenl. 

In addifion fo fhe expected boolean resull of fhe 
application of a consfrainf, also a features trans¬ 
portation can be oblainedj^ Fealure Iransporla- 
lion refers fo fhe process of propagaling some kind 
of informalion along fhe dependency free. This 
mechanism is required fo describe nalural lan¬ 
guage phenomena where conslraining informalion 
is applied al a particular node bul originales from 
a sfruclurally dislanl one. Since fhe disfance may 
be of arbilrary lengfh, fhe useful information are 

present a technique similar to arc constraint of Nivre et al 
(2014), called multilayer linguistic supervision which use 
subcategorization information (e.g. transitivity) in order to 
block meaningless transition. 

^^The parser has to include a morphological analyzer, for 
example using a dictionary of word forms and multi-word- 
expression, with associated PoS, lemmas and grammatical 
features (e.g. mood, tense, person, gender, number, case). 

^®This is more complicated in practice, since, starting from 
governor and dependent, the disambiguation process (i.e. 
the elimination of non-valid lexical readings) could involve 
nodes at an arbitrary depth. 

^’Boolean result means that a constraint is passed or vi¬ 
olated. In the latter case it returns also the value of penalty 
laB§ts;:^fifcesentations2011 /Grella . pdf 




hardly contained in the set of local features which 
are visible from the statistical component, so it is 
important for a node to inherit properties from an¬ 
other node. Features transportation could be view 
as a sort of “unification” mechanism widely used 
on unification-based grammar]^ 

Among others, in the new parser this mechan¬ 
ism is used in case of: 

• determiners: a determiner may impose a 
gender, a number and a grammatical case to 
its governor; 

• auxiliary verbs: an auxiliary verb may amend 
mood, tense and voice properties (active vs. 
passive construction); 

• coordination chains: the presence of a co¬ 
ordinate conjunct may create new morpho¬ 
logical properties to the regent node, taking 
into account the properties of all involved 
nodes in the coordination chains. These 
new properties are called tree-gender, tree- 
number and tree-person and may be used in 
order to establish some required future agree¬ 
ment (e.g. subject-predicate agreement! r^l 

In the new hybrid approach, the features trans¬ 
portation mechanism doesn’t require any approx¬ 
imation of high-order constraints (McCrae et ah, 
2008), which are instead indispensable in the ori¬ 
ginal WCDG approach. The reason for this is that 
in wcdg-parsing the heuristic algorithm need to be 
informed of the structural changes (including node 
properties resulted by a transport) in order to do 
proper transformations in case of conflicts, and the 
complexity grows polynomially in the number of 
considered nodesHj In the new approach the ana¬ 
lysis process is mainly data-driven exploiting the 
statistical language model, and the rich inform¬ 
ation resulting of constraints application can be 
used at least for two different purposes: 

1. by the search algorithm, as a “grip” to recover 
wrong transitions, for example through the 
optimistic back tracking technique (Ytrestl, 
2011 ); 

^*See Shieber (2003) for an introduction to unification- 
based approaches to grammar. 

^^The constraint may be fine-tuned and handle special 
case of appositive coordination that does not implicate plural 
number. 

^°In WCDG, the restriction to binary constraints (i.e. de¬ 
pendency relations between two nodes), motivated by com¬ 
putational issue, is a severe limitation of the expressiveness 
of the formalism. 


2. as an addifional fealures for fhe sfafisfical 
componenf, for example fo inform the parser 
that two focus words are particularly “con¬ 
nected”, also predicting early a possible de¬ 
pendency label, or that contrariwise they 
don’t share any syntactic and semantic rela¬ 
tion. 

For what concerns the features used in the stat¬ 
istical component, I believe that in order to obtain 
a robust parser they should be almost entirely del- 
exicalized, except for the functional words where 
the lemma can be viewed as a fine-grained part- 
of-speech information!^ This is necessary also 
for the number of lexical ambiguities that does not 
easily allow a discrete features representation. The 
new parser performs together, at the same time, 
syntactic and morphological analysis, so during 
the parsing process the words not yet fully ana¬ 
lyzed keep all their possible lexical readings alive, 
creating, for instance, several features with differ¬ 
ent pos-tags candidates for each word. 

Therefore, instead of the use of sparse forms 
or lemmas features, which represent “semantic in¬ 
stance”, in the new parser I found more appropri¬ 
ate to use some score of semantic relations ob¬ 
tained exploiting a large vectorial semantic space 
that records syntactic dependencies (e.g. subj- 
verb, dobj-verb, pp-attachment). This structured 
knowledge could be generated using the learning 
by reading technique, where a basic parser reads 
a large quantities of text in order to create inform¬ 
ation for the second parser!^ Using also a stable 
semantic knowledge, it is possible improve syn¬ 
tactic parsing following some specific insights of 
Christen (2013). 

Thanks to the active use of the constraints, the 
new parser is able to insert traces in a bottom-up 
way, in particular for missing arguments (e.g. for 
modal verbs). The mechanism of traces has been 
extended to cover more complex phenomena, such 
as gaps (e.g. where a verb is missing as during 
a coordination). After the traces integrations, the 
parser uses the same vector semantic space to re¬ 
solve some anaphoric references. 


^'Unlexicalized parsing is also considered to be robust for 
applications such as cross-lingual parsing (McDonald et al., 
2011) 

^^To encode syntactic dependencies, instead of words or¬ 
der, I used the technique described in Basile (2011). 




8 Conclusion 

In this paper I explained the reasons that led me to 
research and conceive a novel a novel technology 
for dependency parsing. In particular I highlight 
the problem to infer the reliability of the results 
of a data-driven transition-based parser, which is 
extremely important for high-level processes that 
expect to use correct parsing results. 

In my research I deeply investigated the use 
of different kinds of “knowledge” (e.g. syntactic 
constraints, rich morphology, punctuation, null 
elements, semantics, linguistic universals, tem¬ 
poral and spatial dimensions) in the same syntactic 
analysis, and identified a number of areas of pos¬ 
sible actions. I then briefly introduced a number of 
notes about a new hybrid approach for dependency 
parsing that combines data-driven transition-based 
and constraint-based approaches. The result is a 
parser that proceeds with the analysis in a “more 
aware” way, that attempts to understand when it 
fails to analyze, maintaining a robust behavior and 
high efficiency. The aim is that a parser accepts 
not only “well-formed’ sentences but also deviant 
structures if no other analysis is feasible. 

I realised that a perfect interoperability between 
software and data (i.e. treebank, dictionaries and 
rules) is crucial^ I have encoded the Italian lex¬ 
icon (subset of) in a formalism inspired by the 
Slot Grammar (SG) of McCord (1980) wherein 
every lexical entry contains information about cat¬ 
egory, morphological features and a set of slots 
and rules for filling them (e.g. words order, agree¬ 
ment information). This lexicon includes subcat¬ 
egorization for nouns, adjectives, verbs and ad¬ 
verbs. A great effort has been required to define 
the constraints grammar. In my research activ¬ 
ity I could take advantage of Lisp implementa¬ 
tion of a particular rule-based parser developed 
by Lesmo (2009), that got the best result (LAS 
88.73%) at Evalita 2009 DPT, and Lector de¬ 
veloped by Christen (1990), from which I extrac¬ 
ted some rules that I then transformed in a set 
of WCDG constraints]^ All the constraints have 
been then fine-tuned on the TUT Treebank and 
were tested against Evalita 2011 DPT, obtaining 

^^Let me give you an analogy: it is known that Apple’s 
superiority is due to its perfect hardware and software integ¬ 
ration. 

'^"^See Christen (2013) for more information about a new 
version of Lector, called Syntagma, which implements lin¬ 
guistic constraints in a way similar to Property Grammar 
(Blache, 2006). 


an attachment score of 96.16%, the best result so 
far for a dependency parser for the Italian lan¬ 
guage. 

Because of the intimate relationship between 
pos-tag and dependency label used in the new 
parser, I developed a special treebank that con¬ 
sists in 6,515 Italian sentences and 108,973 words, 
aligned with the syntactic relations described into 
the constraints and with the lexical readings res¬ 
ulting from the morphological analyzer. This tree- 
bank and the constraints, as well as the criteria for 
establishing dependency relations, are developed 
with the Tesniere’s four fundamental categories 
(1959) in mind, in which only semantical full 
words are allowed to behave as governor, that is: 
verbs, nouns, adjectives, adverbs. Therefore, the 
parser output it is closer to the collapsed version of 
Stanford dependencies, where, for example, a prep 
can’t be a governor, so it is natively more suitable 
for Information Extraction. 

Some resources I developed during these re¬ 
search activities, including the Italian Treebank, 
are freely available on my GitHub repository]^ 

I like to believe that these few notes of mine 
could be a starting point for new researches in the 
field of dependency parsing. 
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